216 research outputs found
Rank, select and access in grammar-compressed strings
Given a string of length on a fixed alphabet of symbols, a
grammar compressor produces a context-free grammar of size that
generates and only . In this paper we describe data structures to
support the following operations on a grammar-compressed string:
\mbox{rank}_c(S,i) (return the number of occurrences of symbol before
position in ); \mbox{select}_c(S,i) (return the position of the th
occurrence of in ); and \mbox{access}(S,i,j) (return substring
). For rank and select we describe data structures of size
bits that support the two operations in time. We
propose another structure that uses
bits and that supports the two queries in , where
is an arbitrary constant. To our knowledge, we are the first to
study the asymptotic complexity of rank and select in the grammar-compressed
setting, and we provide a hardness result showing that significantly improving
the bounds we achieve would imply a major breakthrough on a hard
graph-theoretical problem. Our main result for access is a method that requires
bits of space and time to extract
consecutive symbols from . Alternatively, we can achieve query time using bits of space. This matches a lower bound stated by Verbin
and Yu for strings where is polynomially related to .Comment: 16 page
Lempel-Ziv Parsing in External Memory
For decades, computing the LZ factorization (or LZ77 parsing) of a string has
been a requisite and computationally intensive step in many diverse
applications, including text indexing and data compression. Many algorithms for
LZ77 parsing have been discovered over the years; however, despite the
increasing need to apply LZ77 to massive data sets, no algorithm to date scales
to inputs that exceed the size of internal memory. In this paper we describe
the first algorithm for computing the LZ77 parsing in external memory. Our
algorithm is fast in practice and will allow the next generation of text
indexes to be realised for massive strings and string collections.Comment: 10 page
Searching and Indexing Genomic Databases via Kernelization
The rapid advance of DNA sequencing technologies has yielded databases of
thousands of genomes. To search and index these databases effectively, it is
important that we take advantage of the similarity between those genomes.
Several authors have recently suggested searching or indexing only one
reference genome and the parts of the other genomes where they differ. In this
paper we survey the twenty-year history of this idea and discuss its relation
to kernelization in parameterized complexity
- …